11. Phi Accrual Failure Detection

Let's learn about Phi Accrual Failure Detection algorithm and its usage.

We'll cover the following

Background
Definition
Solution
Examples

Background#

In distributed systems, accurately detecting failures is a hard problem to solve, as we cannot say with 100% surety if a system is genuinely down or is just very slow in responding due to heavy load, network congestion, etc. Conventional failure detection mechanisms like Heartbeating outputs a boolean value telling us if the system is alive or not; there is no middle ground. Heartbeating uses a fixed timeout, and if there is no heartbeat from a server, the system, after the timeout assumes that the server has crashed. Here, the value of the timeout is critical. If we keep the timeout short, the system will detect failures quickly but with many false positives due to slow machines or faulty network. On the other hand, if we keep the timeout long, the false positives will be reduced, but the system will not perform efficiently for being slow in detecting failures.

Definition#

Use adaptive failure detection algorithm as described by Phi Accrual Failure Detector. Accrual means accumulation or the act of accumulating over time. This algorithm uses historical heartbeat information to make the threshold adaptive. Instead of telling if the server is alive or not, a generic Accrual Failure Detector outputs the suspicion level about a server. A higher suspicion level means there are higher chances that the server is down.

Solution#

With Phi Accrual Failure Detector, if a node does not respond, its suspicion level is increased and could be declared dead later. As a node’s suspicion level increases, the system can gradually stop sending new requests to it. Phi Accrual Failure Detector makes a distributed system efficient as it takes into account fluctuations in the network environment and other intermittent server issues before declaring a system completely dead.

Examples#

Cassandra uses the Phi Accrual Failure Detector algorithm to determine the state of the nodes in the cluster.

10. Gossip Protocol

12. Split Brain

Mark as Completed

Report an Issue

Introduction

Summarized System Design Problems

Dynamo: How To Design a Key-Value Store

Cassandra: How to Design a Wide-Column NoSQL Database

Kafka: How to Design a Distributed Messaging System

Chubby: How to Design a Distributed Locking Service

GFS: How to Design a Distributed File System

HDFS: How to Design a Distributed File System

BigTable: How to Design a Wide-Column Storage System

System Design Patterns

Glossary of System Design Basics

Final Assessments

11. Phi Accrual Failure Detection

Background#

Definition#

Solution#

Examples#